Caching: It is one mechanism to speed up applications
that access the same RDD multiple times. An RDD that is not cached, nor
checkpointed, is re-evaluated again each time an action is invoked on
that RDD. There are two function calls for caching an RDD: cache() and
persist(level: StorageLevel). The difference among them is that cache()
will cache the RDD into memory, whereas persist(level) can cache in
memory, on disk, or off-heap memory according to the caching strategy
specified by level.persist() without an argument is equivalent with
cache().
Checkpointing
- RDD.cache is also a lazy operation.
- If you run textFile.count the first time, the file will be loaded, cached, and counted.
- If you call textFile.count a second time, the operation will use the cache.
- It will just take the data from the cache and count the lines.text.cache
Checkpointing
- Checkpointing stores the rdd physically to hdfs and destroys the lineage that created it.
- The checkpoint file won't be deleted even after the Spark application terminated.
- Checkpoint files can be used in subsequent job run or driver program
- Checkpointing an RDD causes double computation because the operation will first call a cache before doing the actual job of computing and writing to the checkpoint directory.
No comments:
Post a Comment